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URL Frontier 


Robots.txt - Exclusion 


e Protocol for giving spiders (“robots”) limited access to а 


website 


e Source: http://www.robotstxt.org/wc/norobots.html 


e Website announces what is okay and not okay to crawl: 


e Located at http://www.myurl.com/robots.txt 


e [his file holds the restrictions 


Robots.txt Example 


e http://www.ics.uci.edu/robots.txt 


User-agent: MOMspider # The Multi-Owner Maintenance Spider 
Disallow: /cgi-bin/ Script files 

Disallow: /Admin/MOM/ Local MOMspider output 
Disallow: /~fielding/MOM/ Local MOMspider output 
Disallow: /TR/ Dienst Technical Report Server 
Disallow: /Server/ Dienst Technical Report Server 
Disallow: /Document/ Dienst Technical Report Server 
Disallow: /MetaServer/ Dienst Technical Report Server 
Disallow: /~eppstein/pubs/cites/ ¥ Eppstein Database 
Disallow: /-fiorello/pvt/ Private pages 


User-agent: * 

Disallow: /cgi-bin/ 

Disallow: /Test/ 

Disallow: /Admin/ 

Disallow: /TR/ 

Disallow: /Server/ 

Disallow: /Document/ 

Disallow: /MetaServer/ 

Disallow: /~fielding/MOM/ 
Disallow: /~kanderso/hidden 
Disallow: deeler Cd рира /озБев/ 
Disallow: /-fiorello/pvt/ 
Disallow: /-dean/ 
Disallow: /-wwwoffic/ 
Disallow: /~ucounsel/ 
Disallow: /-sao/ 
Disallow: /~support/ 
Disallow: /-icsdb/ 


All other spiders should avoid 
Script files 
The test area for web experimentation 
Huge server statistic logs 
Dienst Technical Report Server 
Dienst Technical Report Server 
Dienst Technical Report Server 
Dienst Technical Report Server 
Local MOMspider output 
Ken Anderson's stuff 

¥ Eppstein Database 

Private pages 
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Disallow: /bin/ 


URL Frontier 


sitemaps - Inclusion 
ө https://www.google.com/webmasters/tools/docs/en/protocol.htmltsitemapXMLExample 


<?xml version="1.0" encoding="UTF-&8"?> 
*urlset xmlnss"http://www.sitemaps.org/schemas/sitemap/0.9"» 


«url» 
<loc>http: //www.example.com/</loc> 
<lastmod>2005-01-01</lastmod> 
<changefreq>monthly</changefreq> 
€priority>0.8</priocrity> 

«/url» 

«url» 
€loc»http://www.example.com/catalog?itemsl2&amp;descsvacation hawaiic/loc» 
“<changefreq>weekly</changefreq> 

</url> 

«url» 
<loc>http: //www.example.com/catalog?item=73&amp;desc=vacation new zealande/loc- 
<lastmod>2004-12-23</lastmod> 
“<changefreq>weekly</changefreq> 

</url> 

«url» 
<loc>http: //www.example.com/catalog?item=74&amp;desc=vacation newfoundland</loc> 
<lastmod>2004-12-23T18:00:154+00:00</lastmod> 
“priority>0.3</priocrity> 

</url> 

«url» 

“<loc>http: //www.example.com/catalog?item=83&amp;desc=vacation usa</loc> 
<lastmod>2004-11-23</lastmod> 

</url> 

</urlset> 


Web Crawling Outline 


Overview 
e Introduction 


e URL Frontier 


e Robust Crawling 
e DNS 


Robust Crawling 


A Robust Crawl Architecture 
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Robust Crawling 


Processing Steps in Crawling 
e Pick a URL from the frontier (how to prioritize?) 


e Fetch the document (DNS lookup) 
e Parse the URL 
e Extract Links 
e Check for duplicate content 
e |f not add to index 
e For each extracted link 


e Make sure it passes filter (robots.txt) 


e Make sure it isn't in the URL frontier 


Domain Name Server 


e A lookup service on the internet 
e Given a URL, retrieve its IP address 
e www.djp3.net -> 69.17.116.124 
e This service is provided by a distributed set of servers 


e Latency can be high 


e Even seconds 


Domain Name Server 


e Common OS implementations of DNS lookup are blocking 
e One request at a time 

e Solution: 
e Caching 


e Batch requests 


e Custom resolvers 


Ask 192.5.6.30 


Where is www.djp3.net? 
Ask 72.1.140.145 


Use 69.17.116.124 


© 


Give me a web page 


dig +trace www.djp3.net 


Root Name 
Server 


(A.ROOT-SERVERS.NET = 198.41.0.4 


{A}.GTLD-SERVERS.net = 192.5.6.30 


djp3.net 
Name 
Server 


{ns1}.speakeasy.net =72.1.140.145 


www.djp3.net = 69.17.116.124 


Name 


Host table 


The User 
flickr:crankyT 


Give me а www.djp3.net —b 


DNS Cache 
OS specified DNS Server 
ns1.ics.uci.edu 
OS DNS 
Cache 
OS DNS 
Resolver 
Firefox DNS \ 
cache 


What really happens 


{A}.ROOT-SERVERS.NET = 198.41.0.4 


{A}.GTLD-SERVERS.net = 192.5.6.30 


{ns1}.speakeasy.net =72.1. 140.145 


www.djp3.net = 69. 17.116. 124 


Client 


Firefox 


Class Exercise 
e Calculate how long it would take to completely fill a DNS 


cache. 
e How many active hosts are there? 


e Whatis an average lookup time? 


e Do the math. 


Google Public DNS р © Why run a DNS lookup 


ice? 
Oo . ` What is Google Public DNS? service: 
KA- 


Google Public DNS is a free, global Domain N 


alternative to your current DNS provider. © It’s y OUT a d Im | N ist rative 


OpenDNS 


domain 


zoneedit 


® A public good 


It helos your other business 


e You can make money on bad 


queries 


e Mobile servers need special 
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Robust Crawling 


A Robust Crawl Architecture 
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URL Frontier Queue 


Parsing: URL normalization 


e When a fetched document is parsed 
е some outlink URLs are relative 
e For example: 
e http://en.wikipedia.org/wiki/Main Page 
e hasa link to “/wiki/Special:Statistics” 


e which is the same as 


e http://en.wikipedia.org/wiki/Special:Statistics 


e Parsing involves normalizing (expanding) relative URLs 


Parsing: URL normalization 


e When a fetched document is parsed 
е some outlink URLs are protocol-relative 


e For example: 


e http://www.starbucks.com/ 


ө has a “<script src="//cdn.optimizely.com/js/6558036.js" ></script> 


e which matches the protocol used to load it 


€ "http:" or "https:" or “file:” //cdn.optimizely.com/js/6558056.js 


Robust Crawling 


A Robust Crawl Architecture 
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URL Frontier Queue 


Duplication 


Content Seen? 


e Duplication is widespread on the web 

e |f a page just fetched is already іп the index, don't process it 
any further 

e [his can be done by using document fingerprints/shingles 


e Atype of approximate hashing scheme 


e Similar to watermarking, SIFT features, etc. 


Robust Crawling 


A Robust Crawl Architecture 
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Compliance with webmasters wishes... 


e Robots.txt 
e Filters is a regular expression for a URL to be excluded 
e How often do you check robots.txt? 
е Cache to avoid using bandwidth and loading web server 


e Sitemaps 


e A mechanism to better manage the URL frontier 


Robust Crawling 


A Robust Crawl Architecture 
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Duplicate Elimination 


e For a one-time crawl 
e Test to see if an extracted,parsed, filtered URL 
e has already been sent to the frontier. 
e has already been indexed. 
e Fora continuous crawl 
e See full frontier implementation: 
e Update the URL's priority 
e Based on staleness 


e Based on quality 


e, Based on politeness 


Distributing the crawl 


e The key goal for the architecture of a distributed crawl is 
cache locality 

e We want multiple crawl threads in multiple processes at 
multiple nodes for robustness 
e Geographically distributed for speed 

e Partition the hosts being crawled across nodes 


e Hash typically used for partition 


e How do the nodes communicate? 


Robust Crawling 


The output of the URL Filter at each node is sent to the Duplicate 
Eliminator at all other nodes 


Se Ge 
Finger- | ОВІ 
DNS | prints Index 


To Other Nodes 


Parse | 
| URL Host Duplicate 
? 
Fetch Seen’ | Filter Splitter Elimination 


From Other 
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